Capstone Project - Tan Kelvin (TP063098)

ParlAI Dialogue Safety Model with Emoticons and Internet Slangs Translation

Research Questions

  1. What are the limitations faced by current state of the art natural language processing tools in handling toxic comments?
  2. How could a safer dialogue utility be designed and implemented?
  3. What are the suitable methods and metrics in evaluating the safety performance of a dialogue utility?

1.0 Sampling

1.1 Wikipedia Toxic Comments

Obtained from https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

This dataset has already been used in Dinan et al. (2019) and Xu et al. (2021) for hatespeech classification via ParlAI dialogue safety model. In fact, it was used to build the model.

Thus, to prevent overfitting, it won't be used in the modifying, modelling and assessment stage.

Joining wikipedia test dataset with its labels

Dinan et al. (2019) regrouped all 6 classes in Wikipedia dataset as the toxic class. In the train dataset, there are only 0s and 1s.

In the test dataset, there are -1s, 0s and 1s.

Drop rows with -1s from the Wikipedia test dataset.

Joining train and test wikipedia datasets as wtc_df.

1.2 Davidson Dataset

Obtained from https://github.com/t-davidson/hate-speech-and-offensive-language

Emoji is saved in the form of unicode, as seen below, at the end of the tweet.

1.3 Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior

Obtained by emailing the author at https://github.com/ENCASEH2020/hatespeech-twitter

The labels are in the same column as tweets column and come after a tab at the end of each tweet.

Using regex, extract and remove the labels from the tweets.

The number assigned to each label represents how many annotators agreed that it is said label.

Drop any rows without label.

1.4 ChatEval Twitter

Obtained from https://chateval.org/

This dataset has no toxicity labels, because it is used for chatbot research.

2.0 Exploring

2.1 Statistical Exploration

While the exploring stage is mostly concerned with only exploring what is in the data and modifications are done in the next stage, some modifications are done at this stage for the purpose of efficiency. For example, because in this stage, the number of emojis and slangs will be computed anyway, so while counting them, new columns are also created with the original text converted.

2.1.1 Counting Number of Words

2.1.2 Counting Lexical Diversity

2.1.3 Counting and Converting Emojis

2.1.4 Counting and Converting Internet slangs

As suggested by Bhattacharyya (2019) and Silkej (2020), the detection and translation of Internet slangs can be done using the list from internetslangs.com

2.1.5 Counting Total Number of Emojis and Slangs

2.2 Graphical Exploration

Some issues in AWS when it comes to word cloud, matplotlib and parlai, saving dataframes as csv to continue in PC

For visualizations and ease of navigations, mini dashboards are created using the param and panel libraries.

2.2.1 Wikipedia Toxic Comments Dashboards

Texts in Wikipedia dataset are quite lengthy, with some surpassing more than 1000 words. Lexical diversity is high. As seen in the x-axis, this dataset has more slangs than emojis.

Toxic class has the highest number, followed by obscene and insult.

2.2.2 Davidson Dataset Dashboard

The tweets in Davidson dataset are not longer than 100 words and have very high lexical diversity. There are a lot more slangs than the number of emojis.

2.2.3 Twitter Abusive Dataset Dashboard

In the Tweeter abusive dataset, in terms of number of words and lexical diversity, there are more outliers compared to those in Davidson. There are also more slangs than emojis.

2.2.4 ChatEval Tweeter

This dataset has the lowest number of words, slangs and emojis. There are also more slangs than emojis.

2.2.5 Word Clouds

A lot of the words in Wikipedia dataset are related to forum and editing.

There are lots of insults in the Davidson dataset, many if which are targetted at women and Blacks.

There are a lot of links being sent in the Tweeter abusive dataset.

The ChatEval dataset most likely was collected from American Tweets during an election, as seen in the high numbers of mentions about Trump and Hillary.

3.0 Modifying

Because some modifications were already done while data exploration, not much modification will be done here.

As explained at the sampling stage, the Wikipedia Toxic Dataset will not be used in modifying, modelling and assessment stages, since it was used to built the ParlAI Diaglogue Satefty model and may cause overfitting.

3.1 Davidson Dataset

Creating deslang_demoji column, where the original tweet have both emojis and slangs converted. Urls are removed from the original Tweets.

There are three classes, hate speech, offensive language and neither. Each has its column and a number is given based on the number of annotators who agree that it belongs to said class. The class column shows the class with highest number of annotators, where 2 is neither. Thus, if class is 2 it is not toxic and set to 0, otherwise it is counted as toxic and set to 1.

There are almost 5 times more Tweets that are toxic than those that are not.

3.2 Twitter Abusive Dataset

Creating deslang_demoji column, where the original tweet have both emojis and slangs converted. Urls are removed from the original Tweets.

The labels were extracted from the Tweet at the sampling stage. However, the labels need to be cleaned.

Most of the Tweets are considered normal. But, this research will use binary classification.

The number of non-toxic Tweets is higher than that of the toxic ones.

3.3 ChatEval Tweeter

Creating deslang_demoji column, where the original tweet have both emojis and slangs converted. Urls are removed from the original Tweets.

4.0 Modelling

4.1 Dialogue Safety Classification

Viewing the data and its structure in ParlAI Dialogue Safety model.

In order to run Dialogue Safety Classfication model on the datasets, they need to be converted into text files. To create said text files, the structure of the json file in Dialgoue Safety is examined.

Use the same train test ratio as in Xia et al. (2020).

4.1.1 Davidson dataset

The Safety Dialogue model will be run on the Davidson dataset for 4 times, which include the original Tweets, Tweets with emojis converted, Tweets with slangs converted and Tweets with both converted. Thus, there are a total of 12 text files that will be generated, since each time includes train, test and validation text files.

To ensure that the splits are successful and readable by ParlAI, use the display_data command.

Running the eval_model to evaluate the performance of Safety Dialogue Model on Davidson datasets for all 4 variants.

Twitter Abusive Dataset

Use the same train-test split as found in Xu et al. (2021) on Twitter Abusive Dataset.

Create text files accordingly. For 4 versions and train, test and validation, a total of 12 text files are created.

To ensure that the splits are successful and readable by ParlAI, use the display_data command.

Running the eval_model to evaluate the performance of Safety Dialogue Model on Twitter Abusive datasets for all 4 variants.

Saving the changes and progress to datasets as csv.

4.2 Chatbot

Import the baseline human responses for ChatEval Twitter

Join all the Tweets into a string using a linebreak. Then add an "[EXIT]" to the end of the string

Use the subprocess library to automate the communication with the chatbot model using the combined string

There is no difference in the model's responses to original Tweets and Tweets with emoticons and slangs converted.

There is no difference in the model's responses to original Tweets and Tweets with emoticons and slangs converted.

There is no difference in the model's responses to original Tweets and Tweets with emoticons and slangs converted.

Add the results to the chateval dataframe

At first glance, the emphathetic model seems to offer better responses

For tweets that are classified as unsafe by thr Dialogue Safety model, results from Emphathetic model will be used.

In addition, responses by Tweeter model that have 'sorry' will also be replaced, since such responses tend not to have content.

Responses by Tweeter model that have 'understand' will also be replaced, since such responses tend not to have content.

The result column is created.

Saving the result as a .txt file for submission to ChatEval

5.0 Assessment

5.1 Dialogue safety classification

5.2 Chatbot Evaluation

Unfortunately, the ChatEval platform is having issues at the time of this writing and the human evaluation could not be obtained on time, thus, the responses of the human, JHU Tweeter and proposed model are converted into a json file. This json file will then be used to generate a Google Form, so that the human evaluation can be done by three APU data science students.

The results of the Google Form was saved to a Google Sheet, which are then downloaded as an excel file imported here

Drop the timestamp column

Transpose the results dataframe

Rename the columns

Because Google Forms shuffled the options to reduce bias, the json file used to create the Google Form is imported to tally which model was selected in the form responses

Create three new columns to count number of times each of the three models is selected. For now, put the values as 0

Since some formatting issues might have occured in the process of important to and exporting from Google Form, use the difflib to get the closest match. Then, the counting can be done

Use sum to create a new dataframe that aggregates all the value counts

Convert the aggregation into a dataframe

Rename columns

Plot bar chart